library(MASS)
head(mammals)
## body brain
## Arctic fox 3.385 44.5
## Owl monkey 0.480 15.5
## Mountain beaver 1.350 8.1
## Cow 465.000 423.0
## Grey wolf 36.330 119.5
## Goat 27.660 115.0
ggplot(mammals, aes(x=body,y=brain))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Body Weight", y="Brain Weight", title="Brain Weight Against Body Weight of Land Mammals")
## `geom_smooth()` using formula = 'y ~ x'
Generally speaking, there seems to be an increasing association between body weight and brain weight. The relationship appears to be more logarithmic than linear, but it is hard to tell because there are a few outliers that make the graph difficult to interpret.
To assess assumption 1, the data points should be evenly scattered on both sides of the regression line, as we move from left to right. We do not see this in the scatterplot, so assumption 1 is not met. When body weight is between 1000 and 6000 the data point(s) are above the line. When age is above 6000, the data point(s) are below the line.
To assess assumption 2, the vertical spread of the data points should be constant as we move from left to right. The spread seems to be increasing as we move from left to right (or in other words, the spread is increasing as the response increases), so assumption 2 is not met.
mammalslm<-lm(brain~body, data=mammals)
par(mfrow = c(2, 2))
plot(mammalslm)
The first plot (top left) is the residual plot, with residuals on the y-axis and fitted values on the x-axis. The residual plot can be used to address assumptions 1 and 2. A red line is overlayed to represent the average value of the residuals for differing values along the x-axis. This line should be along the x-axis without any apparent curvature to indicate the form of our model is reasonable. This is not what we see, as we see a clear curved pattern. So assumption 1 is not met. For assumption 2, we want to see the vertical spread of the residuals to be fairly constant as we move from left to right. We do not see this in the residual plot; the vertical spread increases as we move from left to right, so assumption 2 is not met.
The second plot (top right) is the normal probability plot (also called a QQ plot), and addresses assumption 4. If the residuals are normal, the residuals should fall along the 45 degree line. The regression model is fairly robust to this assumption though; the normality assumption is the least crucial of the four. There are clearly very influential outliers that cause the QQplot to deviate at the tails so assumption 4 may not be met.
The third plot (bottom left) is a plot of the square root of the absolute value of the standardized residuals against the fitted values (scale-location). This plot should be used to assess assumption 2, the constant variance assumption. A red line is overlayed to represent the average value on the vertical axis for differing values along the x-axis. If the variance is constant, the red line should be horizontal and the vertical spread of the plot should be constant. It is clear that assumption 2 is certainly not met, which tell a similar story to the first plot.
The last plot (bottom right) is a plot to identify influential outliers. Data points that lie in the contour lines with large Cook’s distance are influential. Two of our data points have Cook’s distance greater than 1 and are therefore are flagged as influential.
Because assumptions 1 and 2 (errors have mean 0 and errors constant variance) are both violated, we want to first transform the response variable to stabilize the variance, and once we fix the variance we can solve the non-linearity needed for assumption 1 by transforming the predictor variable.
boxcox(mammalslm)
boxcox(mammalslm, lambda = seq(-0.5, 0.5, 1/10))
Because 0 lies in the confidence interval (CI), we can choose λ = 0 to log transform the response variable to get y∗ = log(y). This is an ideal solution because log transformations are interpretable and the boxcox supports this solution.
logbrain<-log(mammals$brain)
mammals2<-data.frame(mammals,logbrain)
ggplot(mammals2, aes(x=body,y=logbrain))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Body Weight", y="Log Brain Weight", title="Log Brain Weight Against Body Weight of Land Mammals")
## `geom_smooth()` using formula = 'y ~ x'
The relationship between y*. and x does not appear to be linear. This plot looks very logarithmically distributed, so assumption 1 certainly appears to be violated. Assumption 2 is also not perfect, but I think transforming the x variable will help with that.
mammals2.ystar<-lm(logbrain~body, data=mammals2)
par(mfrow = c(2, 2))
plot(mammals2.ystar)
The variance assumption seems to be much improved, but the mean of errors does not appear to be 0 (they deviate from the red line in the residual vs. fitted plot). For this reason, assumption 1 seems to be violated.
YES. Because assumption 1 is violated, we will need to transform the x variable. Looking at the scatterplot, it looks like a log() transformation would be best. I perform that transformation below:
logbody<-log(mammals2$body)
mammals2<-data.frame(mammals2,logbody)
ggplot(mammals2, aes(x=logbody,y=logbrain))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Log Body Weight", y="Log Brain Weight", title="Log Brain Weight Against Log Body Weight of Land Mammals")
## `geom_smooth()` using formula = 'y ~ x'
Now all of the assumptions appear to be resonably met!The points seem very positively linearly associated with relatively constant variance.
mammals2.stars<-lm(logbrain~logbody, data=mammals2)
par(mfrow = c(2, 2))
plot(mammals2.stars)
Now all the assumptions seem to be satisfied! No further transformations are needed. It is interesting that the influential data points from the original model are no longer influential.
summary(mammals2.stars)
##
## Call:
## lm(formula = logbrain ~ logbody, data = mammals2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71550 -0.49228 -0.06162 0.43597 1.94829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.13479 0.09604 22.23 <2e-16 ***
## logbody 0.75169 0.02846 26.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6943 on 60 degrees of freedom
## Multiple R-squared: 0.9208, Adjusted R-squared: 0.9195
## F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16
The regression equation is \(log(y) = 2.13479 + log(x)*0.75169\) where y = land mammal brain weight and x = land mammal body weight.
\(\beta_1\) = 0.75169. Since both variables were log transformed, this slope shows that for a 1% increase in body weight, the weight of the brain increases by approximately 0.75169%.
library(faraway)
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:survival':
##
## rats, solder
## The following object is masked from 'package:GGally':
##
## happy
head(cornnit)
## yield nitrogen
## 1 115 0
## 2 128 75
## 3 136 150
## 4 135 300
## 5 97 0
## 6 150 75
ggplot(cornnit, aes(x=nitrogen,y=yield))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Nitrogen Fertalizer Applied (pounds/acre)", y="Corn Yield (bushels/acre)", title="Corn Yield against Nitrogen Fertalizer Applied")
## `geom_smooth()` using formula = 'y ~ x'
Generally speaking, there seems to be an increasing association between the number of bushels per acre of corn yielded and the pounds per acre of nitrogen fertilizer applied. The relationship appears to be more logarithmic than linear.
To assess assumption 1, the data points should be evenly scattered on both sides of the regression line and follow the linear shape of the line as we move from left to right. We do not see this in the scatterplot, so assumption 1 is not met. When nitrogen fertilizer applied is below 50 and above 250, the data point(s) are below the line. When nitrogen fertilizer applied is between 50 and 200, the majority of the data point(s) are above the line. Assumption 1 is not met.
To assess assumption 2, the vertical spread of the data points should be constant as we move from left to right. The spread seems to be inconsistent, so assumption 2 is not met either.
cornlm<-lm(yield~nitrogen, data=cornnit)
par(mfrow = c(2, 2))
plot(cornlm)
The residuals vs. fitted and scale-location plots seem to show variation with a bit of fanning toward the right-hand side, so assumption 2 is not met and that assumption should be fixed before we adjust to fix assumption 1. In other words, we should transform the y-variable to create constant variation before adjusting the x-variable to ensure the errors have mean 0.
boxcox(cornlm, lambda = seq(1, 4.5, 1/10))
This boxcox plot shows that a transformation is certainly needed for the y-variable (corn yield) because 1 is not contained within the upper and lower bounds of the CI for \(\lambda\). we can choose λ = 2 to square transform the response variable to get y∗ = \(y^2\).
Note: in part 2d, there are a number of solutions that will work. You must clearly document your reasons for each of your transformations.
cornlm2<-lm((yield)^2~nitrogen, data=cornnit)
par(mfrow = c(2, 2))
plot(cornlm2)
ggplot(cornnit, aes(x=nitrogen,y=(yield^2)))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Nitrogen Fertalizer Applied (pounds/acre)", y="Squared Corn Yield (bushels/acre)^2", title="Squared Corn Yield against Nitrogen Fertalizer Applied")
## `geom_smooth()` using formula = 'y ~ x'
I first transformed the y variable (corn yield) by squaring it in order to create more constant variance as shown necessary by the boxcox plot. This seemed to cause improvement, but looking at the scatterplot it is clear that assumption 1 is still not met. Since the points seem to follow a log(x) or \(sqrt{x}\) shape, I decided to log transform the x value (nitrogen fertilizer applied). Because there are many x variables with a value of 0, and you cannot take the log of 0, I decided to take the square root of x (nitrogen fertilizer applied) instead.
cornlm3<-lm((yield)^2~sqrt(nitrogen), data=cornnit)
par(mfrow = c(2, 2))
plot(cornlm3)
ggplot(cornnit, aes(x=sqrt(nitrogen),y=(yield^2)))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Square Root of Nitrogen Fertalizer Applied sqrt(pounds/acre)", y="Squared Corn Yield (bushels/acre)^2", title="Squared Corn Yield against Logged Nitrogen Fertalizer Applied")
## `geom_smooth()` using formula = 'y ~ x'
Although this scatterplot looks a bit odd with the large gap from 0 to 6 in sqrt(nitrogen fertilizer applied), this regression now fits all the assumptions relatively well.
summary(cornlm3)
##
## Call:
## lm(formula = (yield)^2 ~ sqrt(nitrogen), data = cornnit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7497.3 -1951.6 8.3 2107.3 7160.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9706.27 993.28 9.772 2.22e-12 ***
## sqrt(nitrogen) 803.07 97.68 8.222 2.75e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3674 on 42 degrees of freedom
## Multiple R-squared: 0.6168, Adjusted R-squared: 0.6076
## F-statistic: 67.6 on 1 and 42 DF, p-value: 2.75e-10
The regression question is \(y^2= 9706.27 + 803.07*\sqrt{x}\) with y = corn yielded (bushel/acre) and x = nitrogen fertilizer applied (pounds/acre).
y: Games won (14-game season)x1: Rushing yards (season)x2: Passing yards (season)x3: Punting average (yards/punt)x4: Field goal percentage (FGs made/FGs
attempted)x5: Turnover differential (turnovers acquired -
turnovers lost)x6: Penalty yards (season)x7: Percent rushing (rushing plays/total plays)x8: Opponents’ rushing yards (season)x9: Opponents’ passing yards (season)nfl <- read.table("nfl.txt", header=TRUE)
head(nfl)
## y x1 x2 x3 x4 x5 x6 x7 x8 x9
## 1 10 2113 1985 38.9 64.7 4 868 59.7 2205 1917
## 2 11 2003 2855 38.8 61.3 3 615 55.0 2096 1575
## 3 11 2957 1737 40.1 60.0 14 914 65.6 1847 2175
## 4 13 2285 2905 41.6 45.3 -4 957 61.4 1903 2476
## 5 10 2971 1666 39.2 53.8 15 836 66.1 1457 1866
## 6 11 2309 2927 39.7 74.1 8 786 61.0 1848 2339
#GGally::ggpairs(nfl)
nfl.1 <- lm(y~x2+x7+x8, data=nfl)
summary(nfl.1)
##
## Call:
## lm(formula = y ~ x2 + x7 + x8, data = nfl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0370 -0.7129 -0.2043 1.1101 3.7049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.808372 7.900859 -0.229 0.820899
## x2 0.003598 0.000695 5.177 2.66e-05 ***
## x7 0.193960 0.088233 2.198 0.037815 *
## x8 -0.004816 0.001277 -3.771 0.000938 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.706 on 24 degrees of freedom
## Multiple R-squared: 0.7863, Adjusted R-squared: 0.7596
## F-statistic: 29.44 on 3 and 24 DF, p-value: 3.273e-08
Our estimated regression equation is \(Y = -1.808372 + 0.003598(X2) + 0.193960(X7) + -0.004816(X8)\) where Y= Games won, X2= Passing yards, X7= Percent rushing, and X8= Opponents’ rushing yards.
summary(nfl.1)$coefficients[3,1]
## [1] 0.1939602
The number of games won in a 14-game season (y) increases by 0.1939602 for a one unit increase in the percentage of rushing plays(x7) while holding the team’s passing yardage(x2) and the opponents’ yards rushing(x8) constant.
newdata<-data.frame(x2=2000, x7=48, x8=2350)
predict(nfl.1, newdata, level=0.95, interval="prediction")
## fit lwr upr
## 1 3.381448 -0.5163727 7.279268
The estimate for the number of games won in a 14-game season for a team with a passing yardage of 2000 yards, 48% of rushing plays, and the opponents’ yards rushing equal to 2350 yards is 3.381448 games won.
We have 95% confidence that the number of games won in a 14-game season for a team with 2000 rushing yards, 48% rushing plays, and 2350 rushing yards by the opponent is between -0.5163727 games and 7.279268 games.
summary(nfl.1)
##
## Call:
## lm(formula = y ~ x2 + x7 + x8, data = nfl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0370 -0.7129 -0.2043 1.1101 3.7049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.808372 7.900859 -0.229 0.820899
## x2 0.003598 0.000695 5.177 2.66e-05 ***
## x7 0.193960 0.088233 2.198 0.037815 *
## x8 -0.004816 0.001277 -3.771 0.000938 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.706 on 24 degrees of freedom
## Multiple R-squared: 0.7863, Adjusted R-squared: 0.7596
## F-statistic: 29.44 on 3 and 24 DF, p-value: 3.273e-08
\(H_0\) : \(\beta_0 = \beta_1 = \beta_2 = \beta_3 = 0\)
(We could drop one or more of the variables from our model in the
presence of the other variables.)
\(H_A\) : At least one of the \(\beta\) coefficients is not 0 (and
therefore should not be dropped from the model)
F-statistic = 29.44 and p-value= \(3.273*10^{-8}\)
Because this F-statistic is significantly large and the p-value of \(3.273*10^{-8}\) is smaller than \(alpha=0.05\), we \(H_0\) in favor of \(H_A\). In other words, we have enough evidence to conclude that we cannot drop any of the other predictors to simplify the model. The data supports the claim that the model with these three predictors is useful for predicting the number of wins during the 1976 season.
summary(nfl.1)$coefficients[3,3]
## [1] 2.198262
The t-statistic for the percentage of rushing plays(x7) is 2.198362. Because it is relatively large and the p-value for the percentage of rushing plays(x7) is smaller than \(alpha=0.05\), we have enough evidence to conclude that this variable is significant for predicting the number of games won in a 14-game season(y) when in the presence of the other predictor variables.
par(mfrow = c(2, 2))
plot(nfl.1)
All of the assumptions appear to be met. This is clear because the residuals are scattered randomly around 0 in the residual plot with constant variance.
nfl.2 <- lm(y~x1+x2+x7+x8, data=nfl)
summary(nfl.2)
##
## Call:
## lm(formula = y ~ x1 + x2 + x7 + x8, data = nfl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7456 -0.6801 -0.1941 1.1033 3.7580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8791718 8.1955007 -0.107 0.91550
## x1 0.0009045 0.0016489 0.549 0.58862
## x2 0.0035214 0.0007191 4.897 6.02e-05 ***
## x7 0.1437590 0.1280424 1.123 0.27313
## x8 -0.0046994 0.0013131 -3.579 0.00159 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.732 on 23 degrees of freedom
## Multiple R-squared: 0.7891, Adjusted R-squared: 0.7524
## F-statistic: 21.51 on 4 and 23 DF, p-value: 1.702e-07
The t-statistic for x1, the team’s rushing yards for the season, is 0.549 with a p-value of 0.58862. Because the result of the t-test is not significant, we can drop the predictor x1, the team’s rushing yards for the season, while keeping the other predictors in the model.
I disagree with the classmate who said “Since the result of the t test is insignificant, the team’s rushing yards for the season is not linearly related to the number of wins.” An insignificant t test informs us we can drop that particular predictor, while leaving the other predictors in the model. It does not tell us anything about the linear relationship between the predictor and the response because there are other variables in the model whose interactions may be affecting its relationship with the response variable.
x1: Age. Age in yearsx2: Weight. Weight in poundsx3: HtShoes. Height with shoes in cmx4: Ht. Height without shoes in cmx5: Seated. Seated height in cmx6: Arm. Arm length in cmx7: Thigh. Thigh length in cmx8: Leg. Lower leg length in cmlibrary(faraway)
head(seatpos)
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300
## 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210
## 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673
## 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720
## 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230
## 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 -185.150
seatpos.full <- lm(hipcenter~., data=seatpos)
summary(seatpos.full)
##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
The ANOVA F-statistic is 7.94 with a small p-value. So we reject the null hypothesis and state that our MLR model with all of these predictors is useful. However, notice the t tests are insignificant for the all of the predictor coefficients. Individually, each t test is informing us that we can drop that specific predictor, while leaving the other predictors in the model.
Because almost all the t tests are insignificant, but the ANOVA F is highly significant, there is evidence of multicollinearity. Additionally, the standard errors for some are the estimated coefficients are very large which further suggests the presence of multicollinearity.
round(cor(seatpos[,-9]),3)
## Age Weight HtShoes Ht Seated Arm Thigh Leg
## Age 1.000 0.081 -0.079 -0.090 -0.170 0.360 0.091 -0.042
## Weight 0.081 1.000 0.828 0.829 0.776 0.698 0.573 0.784
## HtShoes -0.079 0.828 1.000 0.998 0.930 0.752 0.725 0.908
## Ht -0.090 0.829 0.998 1.000 0.928 0.752 0.735 0.910
## Seated -0.170 0.776 0.930 0.928 1.000 0.625 0.607 0.812
## Arm 0.360 0.698 0.752 0.752 0.625 1.000 0.671 0.754
## Thigh 0.091 0.573 0.725 0.735 0.607 0.671 1.000 0.650
## Leg -0.042 0.784 0.908 0.910 0.812 0.754 0.650 1.000
There are many variables that are highly correlated with one another! Nearly every correlation value shows moderate to high correlation. This is proof that there is certainly multicollinearity.
faraway::vif(seatpos.full)
## Age Weight HtShoes Ht Seated Arm Thigh
## 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886
## Leg
## 6.694291
Generally, VIFs greater than 5 indicate some degree of multicollinearity, and VIFs greater than 10 indicate a high level of multicollinearity. In the output aboce, we see VIF values as high as 333.137832 and 307.429378! These values, corresponding with Ht and HtShoes respectively, indicate very high levels of multicollinearity.
round(cor(seatpos[,-c(1,2,9)]),3)
## HtShoes Ht Seated Arm Thigh Leg
## HtShoes 1.000 0.998 0.930 0.752 0.725 0.908
## Ht 0.998 1.000 0.928 0.752 0.735 0.910
## Seated 0.930 0.928 1.000 0.625 0.607 0.812
## Arm 0.752 0.752 0.625 1.000 0.671 0.754
## Thigh 0.725 0.735 0.607 0.671 1.000 0.650
## Leg 0.908 0.910 0.812 0.754 0.650 1.000
All of the predictors that describe length of body parts
(HtShoes, Ht, Seated,
Arm, Thigh, and Leg) are
moderately-highly positively linearly correlated. The largest
correlation is between Ht and HtShoes.
Because the largest correlation is between Ht and
HtShoes, I want to use one of these two predictors. Looking
at how each of these are correlated with the other predictors, I see
that Ht has slightly higher correlation values with each of
the other variables than HtShoes does, so I plan to remove
HtShoes, Seated, Arm,
Thigh, and Leg and keep only Ht.
Additionally, height is a fairly reliable measurement that can be more
consistantly measured than many of the other variables.
seatpos.red <- lm(hipcenter~Age+Weight+Ht, data=seatpos)
faraway::vif(seatpos.red)
## Age Weight Ht
## 1.093018 3.457681 3.463303
Because all the VIF values are less than 5 now, the multicollinearity issue has been resolved and there is no longer any strong multicollinearity.
Age, Weight,
HtShoes, Ht, Seated,
Arm, Thigh, and Leg as the
predictors for hipcenterAge, Weight, and
Ht as the predictors for hipcenterCarry out the appropriate hypothesis test to decide which of models 1 or 2 should be used. Be sure to show all steps in your hypothesis test.
\(H_0\) = \(\beta_{HtShoes} = \beta_{Seated} = \beta_{Arm} =
\beta_{Theigh} = \beta_{Leg} = 0\) (The parameters in the
variables we wish to drop are 0, so the reduced model (model 2) is
supported.)
\(H_0\) = At least one coefficient in
\(H_0\) is not 0 (The full model (model
1) is supported.)
anova(seatpos.red, seatpos.full)
## Analysis of Variance Table
##
## Model 1: hipcenter ~ Age + Weight + Ht
## Model 2: hipcenter ~ Age + Weight + HtShoes + Ht + Seated + Arm + Thigh +
## Leg
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 34 45262
## 2 29 41262 5 4000.3 0.5623 0.7279
qf(0.95, 5, 29)
## [1] 2.545386
F-statistic = 0.5623
Critical value = 2.545386
P-Value = 0.7279
Because the F-statistic is smaller than the critical value, and the
p-value is greater than \(alpha=0.05\),
we fail to reject \(H_0\). In other
words we do not have enough evidence to conclude that
HtShoes, Seated, Arm,
Thigh, and Leg are necessary to keep in the
model in the presence of the other variables. We can drop
HtShoes, Seated, Arm,
Thigh, and Leg as they not necessary in this
model, so a reduced model (Model 2) is prefered.
par(mfrow = c(2, 2))
plot(seatpos.red)
All of the assumptions appear to be met. This is clear because the residuals are scattered randomly around 0 in the residual plot with constant variance.
summary(seatpos.red)
##
## Call:
## lm(formula = hipcenter ~ Age + Weight + Ht, data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.526 -23.005 2.164 24.950 53.982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 528.297729 135.312947 3.904 0.000426 ***
## Age 0.519504 0.408039 1.273 0.211593
## Weight 0.004271 0.311720 0.014 0.989149
## Ht -4.211905 0.999056 -4.216 0.000174 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.49 on 34 degrees of freedom
## Multiple R-squared: 0.6562, Adjusted R-squared: 0.6258
## F-statistic: 21.63 on 3 and 34 DF, p-value: 5.125e-08
summary(seatpos.red)$r.squared
## [1] 0.6561654
Our estimated regression equation is \(hipcenter = 528.297729 + 0.519504(Age) + 0.004271(Weight) + -4.211905(Ht)\)
The \(R^2\) of this model is
0.6561654. The coefficient of determination informs us that about 65.62%
of variation in the horizontal distance of the midpoint of the hips from
a fixed location in the car in mm (hipcenter) can be
explained by the driver’s age, weight, and height. We notice that the
\(R^2\) value for this reduce model is
lower than the \(R^2\) for the full
model which was 0.6866. However, when looking at the adjusted \(R^2\) values for the models, the full model
has an adjusted \(R^2\) of 0.6001 while
the reduced model has an adjusted \(R^2\) of 0.6258. \(R^2\) tends to increase with additional
variables in the model, but after adjusting for multicollinearity, we
notice the adjusted \(R^2\) is lower in
the full model than in the reduced model.
library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
ggplot(penguins, aes(x=bill_depth_mm, y=body_mass_g))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Bill Depth (millimeters)", y="Body Mass (grams)", title="Body Mass Against Bill Depth of Penguins")
This plot seems to show a negative correlation between penguin body mass and bill depth overall. It is important to note that the points seem to be clustered in two very separate groups.
ggplot(penguins, aes(x=bill_depth_mm, y=body_mass_g, color=species))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, linewidth=0.5)+
labs(x="Bill Depth (millimeters)", y="Body Mass (grams)", title="Body Mass Against Bill Depth of Penguins by Species")
Now that the scatterplot is colored by species, it is clear that there is a positive linear correlation between body mass and bill depth for penguins of each different species. Gentoo penguins seem to have larger body mass and shorter bill depths than the other species of penguins. The slopes are not exactly parallel, indicating that there may exist an interaction between the species of penguin and its bill depth; the impact of bill depth on body mass differs among the species So a regression model with interaction between species and bill depth may be appropriate.
contrasts(penguins$species)
## Chinstrap Gentoo
## Adelie 0 0
## Chinstrap 1 0
## Gentoo 0 1
penguins.int <- lm(body_mass_g~bill_depth_mm*species, data=penguins)
summary(penguins.int)
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm * species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -845.89 -254.74 -28.46 228.01 1161.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -283.28 437.94 -0.647 0.5182
## bill_depth_mm 217.15 23.82 9.117 <2e-16 ***
## speciesChinstrap 247.06 829.77 0.298 0.7661
## speciesGentoo -175.71 658.43 -0.267 0.7897
## bill_depth_mm:speciesChinstrap -12.53 45.01 -0.278 0.7809
## bill_depth_mm:speciesGentoo 152.29 40.49 3.761 0.0002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 354.9 on 336 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.807, Adjusted R-squared: 0.8041
## F-statistic: 281 on 5 and 336 DF, p-value: < 2.2e-16
We note that Adelie penguins are the reference class, which we confirmed with the contracts() function. Knowing this, we can interpret the estimated regression equation below:
The estimated regression equation is \(\hat y = -283.28 + 217.15(x) + 247.06(I_1) + -175.71(I_2) + -12.53(x*I_1) + 152.29(x*I_2)\) where y = body mass in grams, x= bill depth in millimeters, \(I_1\)= 1 if species is Chinstrap penguins and 0 otherwise, and \(I_2\)= 1 if species is Gentoo penguins and 0 otherwise.
\(H_0\) : \(\beta_4 = \beta_5 = 0\) (We could drop the
interaction terms from our model in the presence of the other
variables.)
\(H_A\) : At least one of the
coefficients in \(H_0\) is not 0 (We
cannot drop the interaction terms from our model in the presence of the
other variables.)
penguins.noint <- lm(body_mass_g~bill_depth_mm + species, data=penguins)
anova(penguins.noint, penguins.int)
## Analysis of Variance Table
##
## Model 1: body_mass_g ~ bill_depth_mm + species
## Model 2: body_mass_g ~ bill_depth_mm * species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 338 44399670
## 2 336 42325191 2 2074479 8.2342 0.0003227 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because the F-statistic is significant, and the p-value is smaller than \(alpha=0.05\), we reject \(H_0\) in favor of \(H_A\). In other words, we have enough evidence to conclude that the interaction terns between species and bill depth are necessary to keep in the model in the presence of the other variables.
\(I_1\)= 1 if species is Chinstrap penguins and 0 otherwise, and \(I_2\)= 1 if species is Gentoo penguins and 0 otherwise, so plugging in these indicator values we get the estimated regression equations for each species of penguins as follows:
Chinstrap penguins:
\(\hat y = -283.28 + 217.15(x) + 247.06(1) +
-175.71(0) + -12.53(x*1) + 152.29(x*0)\)
= \((-283.28 + 247.06) + (217.15 +
-12.53)(x)\)
= \(-36.22 + 204.62x\)
Gentoo penguins:
\(\hat y = -283.28 + 217.15(x) + 247.06(0) +
-175.71(1) + -12.53(x*0) + 152.29(x*1)\)
= \((-283.28 + -175.71) + (217.15 +
152.29)(x)\)
= \(-458.99 + 369.44x\)
Adelie penguins:
\(\hat y = -283.28 + 217.15(x) + 247.06(0) +
-175.71(0) + -12.53(x*0) + 152.29(x*0)\)
= \(-283.28 + 217.15(x)\)
par(mfrow = c(2, 2))
plot(penguins.int)
acf(penguins.int$residuals, main="ACF Plot of Residuals")
All of the assumptions appear to be met. This is clear because the residuals are scattered randomly around 0 in the residual plot with constant variance.
When looking at the ACF plot, we see quite a few significant lag values, but this is because the data is sorted by species, and because gentoo penguins have the largest body mass it is somewhat sorted by weight.
If we are able to, conduct Tukey’s multiple comparisons and contextually interpret the results of these hypothesis tests.
We can conduct pairwise comparisons for the difference in mean body mass among all pairs of species for given values bill depth by conducting three separate hypotheses tests. However, we need to be careful when conducting these tests to avoid making type 1 errors. To ensure the chance of making any Type I error is not more than α, we can use to Tukey procedure.
pairwise<-multcomp::glht(penguins.noint, linfct = mcp(species= "Tukey"))
summary(pairwise)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = body_mass_g ~ bill_depth_mm + species, data = penguins)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Chinstrap - Adelie == 0 13.38 52.95 0.253 0.965
## Gentoo - Adelie == 0 2238.67 73.68 30.383 <1e-05 ***
## Gentoo - Chinstrap == 0 2225.29 81.53 27.295 <1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
We can see that there is a significant difference in the mean body mass for all pairs of species apart from Chinstrap - Adelie, for given bill depth, since both the Gentoo tests are significant. We note that the data supports the claim that there is significant difference in mean body mass of penguins of the Gentoo and Chinstrap species as well as the Gentoo and Adelie species when controlling for bill depth. However, there is not a significant difference in mean body mass for penguins in the Chinstrap and Adelie species when controlling for bill depth which is consistent with what we saw in the scatterplot from part 6b.
Given the positive values for the difference in the estimated coefficients, penguins in the Gentoo species have the highest body mass, followed by penguins in the Chinstrap species, and then penguins in the Adelie species, when bill depth is controlled.
PAY, y: average annual public school teacher salary, in
dollars.SPEND, x1: Spending on public schools per student, in
dollars.AREA: Region (North, South, West).Table 1 below provides some summary statistics of the data:
| Region | n | Mean PAY | Mean SPEND |
| North | 21 | \(\$24424\) | \(\$3902\) |
| South | 17 | \(\$22894\) | \(\$3274\) |
| West | 13 | \(\$26159\) | \(\$3919\) |
Table 1: Summary Statistics of Teacher Pay
Mean teacher pay seems to be highest in the West region and lowest in the South.
There seems to be a positively correlated relationship between the public school expenditure per student and the mean teacher pay. The larger the value teachers are paid seems to correlate postitvely with larger amounts spent per student on average.
A multiple linear regression model with teacher pay as the response variable with geographic area and public school expenditure (per student) can give further insight into the relationship(s) between these variables because we will be able to address how region and expenditure interact to impact mean teacher pay. This will show if there is a significant difference between mean pay based on region when holding expenditures constant.
Use the following info to answer the rest of question 7.
We want to see if geographic region and spending on public schools affect the average public teacher pay. A regression with no interactions was fitted, i.e., \[E(y)=β0 +β_1x_1 +β_2I_2 +β_3I_3,\] where I2 and I3 are the dummy codes for AREA. I2 = 1 if AREA = South, 0 otherwise, and I3 = 1 if AREA = West, 0 otherwise. The following output from R is shown below
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.160e+04 1.334e+03 8.690 2.43e-11 ***
SPEND 3.289e+00 3.176e-01 10.354 1.03e-13 ***
AREASouth 5.294e+02 7.669e+02 0.690 0.4934
AREAWest 1.674e+03 8.012e+02 2.089 0.0422 *
############################################
##Variance-Covariance matrix for beta hats##
############################################
(Intercept) SPEND AREASouth AREAWest
(Intercept) 1780535.6980 -393.5597348 -491859.07243 -2.381145e+05
SPEND -393.5597 0.1008967 63.18227 -1.870101e+00
AREASouth -491859.0724 63.1822716 588126.71689 2.442380e+05
AREAWest -238114.5499 -1.8701007 244238.02959 6.418738e+05
\(\beta_2\) = 529.4. The estimated difference in the mean annual public school teacher salary between public schools in the South and North regions is 529.4, for given spending on public schools per student. We interpret this as the average annual public school teacher salary for public schools in the South region is $529.4 higher than schools in the North, when controlling for public school expenditure per student.
# beta_j +- t_(1-alpha)(2*g),(n-p) * se(beta_j)
p= 4
n= 51
t= qt(1-0.05/6, n-p)
b_2 = 529.4
se_b2 = 7.669e+02
cat("South - North: [", b_2 - (t*se_b2), ",", b_2 + (t*se_b2),"] \n")
## South - North: [ -1374.578 , 2433.378 ]
b_3 = 1674
se_b3 = 8.012e+02
cat("North - West: [", b_3 - (t*se_b3), ",", b_3 + (t*se_b3),"] \n")
## North - West: [ -315.1348 , 3663.135 ]
#(b_2 - b_3)
b2b3 = (529.4 - 1674)
se_b2b3 = sqrt(588126.71689+ 641873.8 - 2*244238)
cat("South - West: [", b2b3 - (t*se_b2b3), ",", b2b3 + (t*se_b2b3),"] ")
## South - West: [ -3282.493 , 993.2934 ]
All three CIs include 0, so there is NOT a significant difference in mean annual public school teacher salary of schools between all pairs of regions, when controlling for public school expenditure per student.